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Abstract 

Id ihis paper, we describe a new real -lime visual system thai enables u humanoid robol lo learn Gum and interact with 
humans. 'I"hc core of the visual system is a pmhahi lislic tracker that uses shape and color information to find relevant 
objects in the scene. Multiscalc representations, windowing and masking arc employed to accelerate the data processing. The 
perception system is direcdy coupled with the motor control system of our humanoid robol DB. We present two ease studies 
of on-line interaction with a humanoid robot: mimicking of human hand motion and smooth pursuit of human head motion. 
The generation of humanoid robot motion based on U»c position of relevant body parts is accomplished in real time. Both 
studies are supported by experimental results on DB. © 2001 Elsevier Science B.V. All rights reserved. 

Keyword*: Re*1-iime visual trading; Humanoid rohols; Mimicking; StiMM>lh pursuit 
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1. Introduction 

Wc arc currently investigating ways to program 
and interact with a humanoid robot. Movement im- 
itation or mimicking and higher forms of learning 
from demonstrations have been identified as a useful 
tool for pmgranuning such robots [1,11 J. To learn 
from demons [rations and lo interact with humans, lite 
humanoid must be able to perceive human motion. 
While off-line processing of visual data is sometimes 
acceptable for learning front demonstration [14.15], a 
real-time perceptual system is essential for interaction 
tasks. Once motion perception is seen as a continu- 
ous process that interacts with the motor system, the 
required standards of reliability become much more 
stringent because failure in just one image frame 
might cause the entire system 10 break down. 
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Our humanoid robol DB (see Figs. 3 and 5) has 30 
degrees of freedom: seven for each arm, three for each 
leg, two for each eye, three for the head and three for 
the lorso. Each eye of the robot's oculomotor system 
consists of two cameras, a wide-angle (100 degrees 
view angle horizontally) color camera for peripheral 
vision, and a second narrow view camera (24 degrees 
view angle horizontally) providing a color image for 
foveal vision. This setup mimics the foveated retinal 
structure of primates. Such setup is essential for an 
artificial vision system in order to obtain high resolu- 
tion images of objects of interest while still being able 
to perceive events in the peripheral environment. The 
images from the wide-angle cameras are captured and 
processed by standard PCs running the Windows NT 
operating system. The extracted data is sent via serial 
connections to a Power PC processor that generates 
data needed by a motor control system. 

The key Issue when realizing a real -time motion 
perception system is lo avoid excessive interaction 
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between the pieces of data in both the time and spatiaJ 
domains. A practical system Tor perceiving human mo- 
tion should he ahlc to deal with complex environments 
and at least moderately changing lighting conditions. 
Probabilistic approaches are the prime candidates to 
explore because they allow us to prevent excessive data 
interaction Ih rough independency assumptions and be- 
causc^continiious probabilities associated with image 
pixels prevent the perceptual algorithms from being 
brittle with respect to the variations in the background 
and lighting conditions. By putting the motion track- 
ing and estimation problems in a Bayesian setting, we 
can lililize a maximum likelihood approach to find the 
relevant objects and to recover the observed motion. 
This 'should enable reliable motion tracking once the 
real-time process is allowed to run. 



where o>* is the prior (mixture) probability to ob- 
serve the process 0*, ]T£5)W = 1, and & = {So, 

&\ Finally, neglecting the correlation 

of assigning neighboring pixels to processes, we can 
evaluate the overall probability to observe the image /: 



P(/) = P(/|©) = Y\?(lu, u\&). 



(2) 



At each time step, we would like to determine 
<#i &fc,a)o,o>\, . tox+O so that the likeli- 
hood (2) is maximized. Instead of maximizing (2). it 
is often easier to minimize the negative log likelihood 



UO, m = -iog(P(/j©» 

= -£lOg(P(/«,tf|0)>, 



(3) 



2. Motion perception system 

The goal of our real-time visual system is to per- 
ceive, motion of body parts such as hands and head 
as well as objects that the observed person is manip- 
ulating. The most important part of the system is a 
real-time tracker, which we present in this section. We 
also consider some important issues such as 2D shape 
estimation, exclusion reasoning, and 3D estimation of 
positions of the tracked entities using stereo. 

2.7. Probabilistic framework 

We represent the observed environment by a num- 
ber of random processes. Each entity to be tracked 
is represented by one process. Let us denote the 
probability that a pixel positioned at u = (u. »>) hav- 
ing color intensity /„ was generated by the process 
6>t, k = 1, . . . , K, by P(/ u , «[&*). We also introduce 
two additional processes: the optional background 
process &k+i, which describes the stationary back- 
ground (useful only for fixed cameras), and the outlier 
process &o, which models the data not captured by 
other processes. Assuming that every pixel stems 
from one of the mutually independent processes 

<9t. k — 0 K + 1 (closed-world assumption), we 

can write the probability that color /„ was observed 
at location u using the total probability- law 

P(/„,ttie) = ][>*P(/«, u\&th (i) 

4r=0 



where a> = (<do, — wjt+O. Taking into account that 
mixture probabilities should add up to 1. the corre- 
sponding Lagrangian function is given by 

L(0 : <*>. X) 

= -£>g(P</ B ,me)) + ~ l ) • < 4 > 

At a local extrerauin. we have 

3 f K+1 \ 

1 u \k=a / 
_ y^ fcy(3/9<9r)P(7»,tt|6>/) 

= J2 Pu t ^ ^OV.. (5) 

u * 

where / = I, . . . . K, and p u ,i >s the probability that 
pixel u stems from the /lh process 



Similarly, we obtain 



(6) 



(7) 
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— L(«,w. X) 

d(Of 



= — ; — — + k 



(8) 



The parameters describing the observed environment 
should be calculated by solving Eqs. <5>. (7) and (8). 
These equations can only be solved ileralively. A good 
iterative approach for this problem is provided by the 
EM algorithm, in which this is done by first calculating 
the: probabilities p ui using the current estimate for & 
arid'u '(the expectation step) and then solving Eqs. (5). 
(7) and (8) as if p u j were constants independent of & 
(the inaximiziititKi step). This process is repeated until 
convergence. 

j»Thc expectation step consists of calculating the 
probabilities (6) and is theoretically trivial, although 
it^akes most of the processing time in practice. In 
thcSfoUowing. we concentrate on the maximization 
step, i.e. the calculation of the unknown parameters 

given the probabilities p u j. A. <^k+i can be 

calculated directly regardless of the choice of prob- 
ability distribmion for 0. Taking into account that 
52fJt l PoJ — 1 < see Efl- the solution to Eqs. (7) 
and (K) turns om to he A — TV, where N is equal to 
the number of pixels, and 



(9) 



Intuitively, cot is proportional to the percentage of 
pixels stemming from the process &i. 

To calculate the rest of the parameters, we mast first 
decide how to model the process distributions 6>*. Re- 
searchers have used various features when modeling 
images by mixture models such as. e.g., intensity vari- 
ations r61. color T2,91. optical flow combined with the 
spatial coherence |3J. and 3D ellipsoidal models [7 J. 
The distribution of these features is usually modeled 
as Gaussian, which significantly simplifies the calcu- 
lation of logarithms of probabilities in Eq. (5). 

Our approach uses shape and color mix aires (or 
sometimes color intensity mixtures) to evaluate the 
probability that a pixel belongs to a certain process. 
Assuming that these properties are independent of 
each other, we can write 
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/*</*, «ie*/> - p(/«ieft)p(«ie»). W 

In many cases, for example, when tracking body pans, 
the 2D shape of the u-acked objects is roughly ellip- 
soidal and we can estimate it by the center of the 
object's image xi ;uid by the covariancc matrix 27/ of 
pixels contained in iL The shape part of the probability 
that a pixel u belongs to a blob can then be estimated 
as 



1 



2*Vdet(2;/) 
xexp(-4(x- 



xt) 



-*/)). (11) 



Assuming that the object's texture consists of a finite 
number of colors, vvc can mtxicl the color probabilities 
by a Gaussian mixture model 



where coi,* = 1 and 



(12) 



#>(/.l'tt.rf.ft) 

i 



V(2jr)- 0l3 det(r/) 

x cxp (~Uiu - lU)fTla 0 - 7u>) . 



(13) 



We experimented both with colors and color intensi- 
ties, therefore 2 or 3 in Eq. (13) depending on the 
dimension of /«. _ 

The adaptation of colors //,* and their covarianccs 
n.k within the EM algorithm makes the tracking un- 
stable, therefore we keep them constant. The necessary 
parameters are determined in an off-line initialization 
phase. This means that 



ofo'/ 

= 3^1og(/><«|6>/)). 



(14) 



The parameters to be estimated are the objects 1 posi- 
tions xi and covarianccs and sometimes the color 
mixture probabilities ay,*. Writing 



Pu,l.k = 



<*>i,kp{lu\?Lk- {Mtj 



(15) 
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we am transform Eq. (5) into 

It is well known that these equations can be solved 
by computing the weighted mean and covariances of 
image pixels with p u ,i being used as weights. 

The reasoning when estimating the color mixture 
probabilities w l:k is similar as in the case of the esti- 
mation of blob mixture probabilities on. The result is 

where wf are the newly calculated blob mixture prob- 
abilities. This completes the basic algorithm used by 
our tracker. 

2. 2 . Real-time considerations 

The resolution of color images, which are captured 
by our system at 30 Hz. is 320 x 240. This means that 
we need to process ca. 6.6 MB of data every second. 
One way to reduce the compulation lime would be to 
model properties of the tracked entities by a simpler 



probability distribution than the normal distribution. 
This was done in [51, although in a somewhat differ- 
ent setting. We decided rather to stay with the normal 
distribution and to reduce the computation lime by 
narrow ing the areas in the image in which the proba- 
bilities need to be evaluated. 

The regions of interest are first determined by win- 
dows located at the previous or predicted position of 
each tracked object. This works fine for compact ob- 
jects whose minor and major axes do not differ too 
much. But a window is only a poor approximation 
for elongated objects such as sticks in Fig. I . There- 
fore, we also generate an ellipsoidal mask around the 
tracked object. The mask is specified by a binary im- 
age having 1 at pixels where the probabilities need to 
be evaluated. 

To further reduce the computation time, we process 
images at I wo different resolu lions. First wc run the 
EM algorithm on the reduced resolution image, i.e. 
160 x 120. In our experiments, lite window and the 
mask size are typically set to be 1 .75 to 2 times larger 
than the minimal bounding box containing the object. 
The initial position is taken to be the previous or the 
predicted position, but the initial object size is set to 
be 1.5 to 2 times larger than the object in the previ- 
ous image so that also rapidly moving objects can be 
tracked. After one or two iteration steps, wc increase 
the resolution to the full resolution, but with the win- 
dow and mask size reduced to only 1.25 of the initial 
object size. The initial position and shape in this sec- 
ond iteration arc taken to be the ptxsition and shape 
estimated al the lower resolution. Again, only one or 
two steps of the EM algorithm are performed. For very 
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big objects, we do not cany out the full resolution it- 
eration at all because their position and shape can be 
determined reliably already in the lower resolution im- 
age and because the processing of big objects become 
very expensive at the lull resolution. 

Wc made use of the free Intel Image Processing 
Library, which is available from hllp://developer.iniel. 
wnVsoftwitrc/products/pcrflib/, to implement our sys- 
tem. The library is optimized for various Intel Pen- 
tium processors and is effective at taking advantage 
of the MMX technology. II also includes support for 
windowing and masking, thus making it suitable for 
the development of real-lime vision systems like ours. 

2.3. Shape estimation 

■ *r 

; The' size (the major and the minor axes a and /» 
and the orientation [0) of the blob B can be calculated 
using the relationship between these parameters and 
the co variance of the pixels within the blob. In the 
ideal case, we can make a crisp decision whether a 
pixel belongs to a blob or not. The covariance of the 
pixels is then given by 

l f f U - xoP (jc - j*>Ky - yo)] dr dy 

irabJe |_(.r - xo)(y - }Xi) iy - .VT)) 2 J 

71 Mb \JxVa*+y*/b*& |_ ^ } ~ J / 

where (xa, yo) is the center of the ellipse and R the 
rotation matrix aligning the ellipse with the coordi- 
nate axes. It follows that wc can estimate the size and 
the orientation of the tracked object by solving the 
eigenvalue problem for the estimated covariance ma- 
trix. The lengths of die major and the minor axes are 
given by a — 2 v 'Ai and b — Z-v/IT. where >.\ and 
>.2 are the larger and the smaller eigenvalue of the es- 
timated covariance matrix, respectively. The rotation 
matrix R (and from it the angle 0) is given by a matrix 
with the corresponding eigenvectors in its columas. 
These parameters can be used to generate masks for 
efficient processing and to draw results into the cap- 
tured images. This is how the result images shown in 
this paper were generated. 
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2.4. Occlusion reasoning 

Occlusions arc common in real environments, espe- 
cially when observing humans manipulating objects, 
such as in Fig. I . Occlusion reasoning usually requires 
a prediction step, as for example, in 18J, on the basis 
of which we can determine which objects will overlap 
in the next image frame and where. 

In general, we have no information about the 
physics of motion of the observed objects, Uterefore 
we predict each parameter using a discrete second- 
order dynamical system 

x(t) = axil - 1) + bx{f -2) + e(i). (19) 

where x{t) is one of the parameters describing the 
tracked objeci (position, orientation, shape) and <?(/) 
is the system noise, both given at time /. The un- 
known parameters a and b arc estimated using recur- 
sive least-squares with a forgetting factor. 

Once the objeci positions in the next frame are es- 
timated, we can find the regions of occlusion. The de- 
cision which of the two overlapping objects is in front 
can be made using the estimated 3D posi lions (see 
Section 2.5) or prior knowledge. If wc predict that at 
a certain pixel one object will overlap the oiher and if 
in the next image frame we really find evidence that 
the pixel was generated by the first object Ihcn wc as- 
sign the estimated probability of the first object's ap- 
pearance (10) not only to the objeci itself, but also to 
the overlapped object Since we have less confidence 
that the second object really projects i>nto this pixel, 
we assign only half of ihe acuially estimated proba- 
bility to the second object. If the probability that the 
second object projects onto this pixel is greater than 
the probability of the supposedly overlapping object, 
then this probability is retained. 

We can reason about occlusions only for objects 
that are actually being tracked. We cannot say much 
when one of ihe tracked objects is occluded by some- 
thing that our system cannot perceive. This would re- 
quire more complicated shape analysis thai we want 
to avoid. Our approach is based on the prediction of 
future positions and can thus work only when the pre- 
dictions are reliable. We have obtained good results 
when using a high-speed camera as in Fig. 1. How- 
ever, the approach becomes less reliable when tracking 
rapidly moving objects with standard video cameras 
as in our real-time system. 
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2.5. 3D position estimation 

We use stereo to estimate the position of tracked 
entities such as hand or head. Wc could take the center 
of Mobs in both images to estimate the 3D position 
of a tracked body part. However, the two centers give 
only a rather crude stereo correspondence because the 
blobs found; in the two images do not cover the same 
areas on the body (see Fig. 2), This happens because 
of difterencesVin the viewing direction and because of 
uncertainties in the estimation of shape. 

Cross-correlation is a standard method for the cal- 
culation of steYeo correspondences. In our case, we 
are not interested in generating a full depth map. but 
only to estimate a 3D blob position. We take the cen- 
ter of the blob in the left image as a starling point A 
box template around the blob center is extracted and 
wc attempt-to find the best match in the right image 
using zero, mean normalized cross -correlation 



ZNCO,4«,M+<f) 

_ ■,,.COV(/ <t ,f, / H +rf.r) 

v /var(. l a j) v'varC/a+t/.r) ' 



(20) 



where 

COV(/„V. /u+d.r) 

in - V) 



(n - 1) 



and /„ is the mean color within the box around pixel 
M, The maximum of correlation (20) is sought for in 
a region defined by a sliec of the image along the 
epipolar line that lies within the right blob. 



X Experiment I : Mimicking of hand motion 

The first example on which wc show the usefulness 
of the developed vision system is mimicking of human 
hand motion. The task of the humanoid is to move 
its palm along the same type of path as the human 
demonstrator. Fig. 3 shows an example of mimicking 
a circular motion. 

DB first records the initial demonstrator's hand po- 
sition as detected by his vision system. At the same 
time, DB's hand position is also recorded. As the 
demonstrator starts moving his hand, DB determines 
the demonstrator's hand motion relative to the initial 
hand position and generates points in his workspace 
that result in the same path relative to the initial DB's 
hand position. These 3D points are transmitted to DB's 
control system in real lime as the desired hand posi- 
tions. The Cartesian hand positions are transformed, 
again in real time, into DB's joint angles using an in- 
verse kinematics method described in |"131. The result- 
ing joint angle positions arc followed by DB's con- 
troller. 

The main problem with this approach is that 3D 
point positions arc very noisy both because of the 
uncertainties in our vision system and because of 
vibrations caused by DB's motion. To alleviate this 
problem, we developed a method in which DB does 



CD 
S3- 



Q 



CD 

O 




Htg. 2. *l"he delected righl human hand overtayed by Ihe esliniaied hlob as seen by led and righl eye. 
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noi start moving along the demonstrated path imme- 
diately but first gathers I second of data (30 points 
on the demonstrated path). A smooth trajectory is 
then generated using least-squares approximation 
with B-splines. The applied B-spline basis should 
have significantly less basis functions than there are 
data points. In the example in Fig. 4, we used only 
six basis functions for 1 second of data. In die next 
second, DB starts moving its hand along the gener- 
ated spline trajectory while continuing to monitor the 
demonstrator's motion. This process is repeated every 
second and continuity constraints arc enforced at the 
edges. Such an approach seems to be natural because 
also humans first try to figure out what is going on 
before they start mimicking other people's motion. 

Another important issue is the choice of arm config- 
uration. A humanoid arm is redundant with respect lo 
the mimicking of hand motion and there is an infinite 



number of configu rations that result in the same hand 
motion. In the experiment in Fig. 3, the robot arm con- 
figuration was different from the demonstrator's arm 
configuration when the mimicking began and this dif- 
ference was retained throughout the mimicking ses- 
sion. Ideally, the visual system should recognize the 
initial arm configuration, but this is a formidable task 
for visual processing, especially if it is to be per- 
formed in real time. We are currently working on this 
problem. 

Mimicking arm motion with a humanoid has been 
done before. The approach in 14J is also based on 
the tracking of hand motion, but the authors map the 
3D Cartesian motion into the joint space using some 
predefined rules and without using any kinematic 
information. While this can be effective in some 
cases, it is rather arbitrary and cannot account for 
all possible motions. Tn addition, their vision system 
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Fig. 4. Least-squares approximation of ihe generated trajectory with B- splines: (solid line) the smoothed trajectories; (dashed line) the 
original trajectories. 



is based on o more standard processing of visual 
information. 



4. Experiment 2: Smooth pursuit 

Very accurate oculomotor control is required in 
order lo track a moving target in the view field of the 
narrow angle fovea! cameras. Primates can precisely 
follow a target moving, e.g., at constant velocity or 
in sinusoidal motion despite significant processing 
delays in the visual pathway. Furthermore, their ocu- 
lomotor control sometimes becomes predictive. This 
cannot be achieved by simple visual feedback control. 

In our system, smooth pursuit oculomotor control 
is achieved by a biologically inspired and control 
theoretically sound controller consisting of two cas- 



caded learning modules. The goal of the first one is 
to learn an inverse model of the oculomotor system, 
while the other tries to learn the dynamics of a visual 
target to predict the current target velocity in the head 
coordinates. Although both modules work together 
to minimize the tracking error in the image, smooth 
pursuit is usually conducted with the assumption that 
the inverse model was learned beforehand in order 
to guarantee that the predictor can learn the proper 
visual target dynamics. We employ LWPR (locally 
weighted projection regression) for the integrated 
learning of both modules fl61. It enables on-line 
learning of higher-order linear/nonlinear dynamics. 

Fig. 5 shows the operation of our vision system 
combined with the described on-line learning algo- 
rithm. The accuracy of the approach is demonstrated 
in Fig. 6. The final rectified mean error reached less 
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Fig. 5. Smooth pursuit with DB's left eye. First row shows the action taken by the external camera, second row shows successful face 
tracking while the camera is moving (the face is overtayed with the detected blob), third row shows how DB's left eye is moving, and fourth 
row shows the view troin all four cameras demonstrating that the person's head stays within the vtewfield of the left narrow-view camera. 



than 0.05 rati, which is very small even for our foveal 
vision. As expected, learning Ihe dynamics of head 
motion was more difficult than learning the dynamics 
of an ideal pendulum that we used in our initial ex- 
periments, but the learning algorithm was nevertheless 
successful. Because of the initial values of learning 
parameters, it did happen occasionally that the learn- 



ing algorithm produced wrong models, which resulted 
in wild eye motions and temporary loss of the visual 
target. It turned out that our vision system is reliable 
enough to recover from such errors so that the learn- 
ing system was able to start learning again after fail- 
ure, A more detailed description <>f this algorithm can 
be found in J 121. 



1 



JSDOCID: <XP 4320520A_I_> 



124 A. Udr et at. /Robotics and Autonomous Systems 37 (2001) US- 125 




time [3} 

Fig. 6. Top: time course of the estimated target position angle (dotted line) and the eye angular position (solid line). Bottom: time course 
erf the rectified mean retinal emir. 



5. Summary and conclusion 

We have presented a real-lime visual system that 
enables a humanoid robot to interact with humans. 
Viewing the motion tracking and estimation problem 
in a probabilistic setting allowed us to avoid the ex- 
cessive simplicity of real-time systems based on some 
form of thresholding and ensured that the system is 
not too brittle with respect to the setting of initial pa- 
rameters. In fact there are only a few parameters that 
need to be set in our system, in addition, the devel- 
oped algorithm is simple enough so that a real-time 
implementation was possible. Compared to other prob- 
abilistic real-time systems, most notably among them 
the "camshifiT algorithm which is included in the pub- 
licly available OpcnCV library [10]. our system con- 
siders not only the distribution of color but also the 
spatial distribution of pixels. This becomes especially 
important when tracking multiple objects and when it 
is necessary to reason about occlusions. 

We performed numerous experiments with the 
proposed approach and used it for the generation of 
several simple behaviors such as smooth pursuit and 
on-line mimicking of human hand motion. Overall, 
the system proved to work reliably when used in 



complex environments such as the ones in Figs. 1-3 
and 5, and lo be reasonably insensitive to moderate 
variations in the lighting conditions. 
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